In this notebook, we will filter and normalize the cells in the SingleCellExperiment (SCE) object obtained from the “1-demultiplex.Rmd” notebook. Hence, we aim to obtain a ready-to-analyze SCE object that does not contain poor-quality cells (i.e. broken or stressed cells) and with its counts normalized to correct for technical artifacts.
library(Matrix)
library(stringr)
library(psych)
library(kmed)
library(pheatmap)
library(fitdistrplus)
library(ggpubr)
library(SingleCellExperiment)
library(scater)
library(scran)
library(SC3)
library(grid)
library(purrr)
library(gridExtra)
library(gridGraphics)
library(Seurat)
library(tidyverse)
source("bin/utils.R")
To calculate the cell quality control metrics, we will use the calculateQCMetrics function from the scater package, which computes a series of QC metrics for each cell (such as library size or number of detected genes), and stores them as new variables in the column metadata of the SingleCellExperiment object (colData). We start by loading the demultiplexed SingleCellExperiment object:
date <- Sys.Date()
# Load demultiplexed SingleCellExperiment object
sce <- readRDS("results/R_objects/SCE_demultiplexed.RDS")
# Filter out unassigned cells
sce <- sce[, sce$condition != "unassigned"]
# Define mitochondrial genes as internal controls
mt_symb <- str_subset(rowData(sce)$name, "^MT-")
mt_ensembl <- rowData(sce)[rowData(sce)$name %in% mt_symb, "id"]
isSpike(sce, "MT") <- rownames(sce) %in% mt_ensembl
# Calculate QC metrics
sce <- calculateQCMetrics(
sce,
feature_controls = list(MT = isSpike(sce, "MT"))
)
sce
## class: SingleCellExperiment
## dim: 20719 11752
## metadata(0):
## assays(1): counts
## rownames(20719): ENSG00000238009 ENSG00000237683 ... ENSG00000215700 ENSG00000215699
## rowData names(11): id name ... total_counts log10_total_counts
## colnames(11752): AAACCTGAGATCACGG-1.1 AAACCTGAGATCTGCT-1 ... TTTGTCAGTTGGTAAA-1 TTTGTCATCAAACAAG-1
## colData names(39): batch donor ... pct_counts_in_top_200_features_MT pct_counts_in_top_500_features_MT
## reducedDimNames(0):
## spikeNames(1): MT
head(colnames(colData(sce)), 10)
## [1] "batch" "donor" "condition" "is_cell_control" "total_features_by_counts" "log10_total_features_by_counts" "total_counts" "log10_total_counts" "pct_counts_in_top_50_features" "pct_counts_in_top_100_features"
We first filter out cells that have a library size (total number of RNA molecules) too small in comparison with other cells. Such cells are likely to have broken or failed to capture. To determine the threshold, we can visualize the library size distribution with a histogram. As there are outliers with a great deal of counts, we will plot the log distribution:
x_titl <- expression("log"[10]*"(library size)")
lib_size_qc <- as.data.frame(colData(sce)) %>%
mutate(exclude = ifelse(log10(total_counts) < 2.85 | log10(total_counts) > 4, TRUE, FALSE)) %>%
ggplot(aes(log10(total_counts), fill = exclude, color = exclude)) +
geom_histogram(bins = 100, alpha = 0.65) +
geom_vline(xintercept = 2.85, color = "red", linetype = "dashed") +
geom_vline(xintercept = 4, color = "red", linetype = "dashed") +
scale_x_continuous(x_titl) +
scale_y_continuous(expand = c(0,0)) +
scale_color_manual(values = c("black", "red2")) +
scale_fill_manual(values = c("black", "red2")) +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))
lib_size_qc
Based on the log distribution, we remove those cells with a library size lower than 10^2.85 = 707 UMI. These cells are likely cellular debris present in empty droplets. Moreover, we also filter cells with > 10,000 UMI, which are likely doublets. Notice that we are using data-driven filters which are based on the comparison between cells:
table(sce$total_counts > 707 & sce$total_counts < 10000)
##
## FALSE TRUE
## 272 11480
keep_lib_size <- sce$total_counts > 707
We next filter by the cell coverage, which is the number of detected genes in each cell (i.e., number of genes with non-zero counts for a given cell). We want to ensure that the reads are distributed across the transcriptome. Thus, we rule out those cells that have an abnormally low number of detected genes.
cell_coverage_hist <- as.data.frame(colData(sce)) %>%
mutate(exclude = ifelse(total_features_by_counts < 350, TRUE, FALSE)) %>%
ggplot(aes(total_features_by_counts, fill = exclude, color = exclude)) +
geom_histogram(bins = 100, alpha = 0.65) +
geom_vline(xintercept = 350, color = "red", linetype = "dashed") +
scale_x_continuous("Number of detected genes") +
scale_y_continuous(expand = c(0,0)) +
scale_color_manual(values = c("black", "red2")) +
scale_fill_manual(values = c("black", "red2")) +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))
library_quality <- ifelse(sce$total_features_by_counts < 350, TRUE, FALSE)
sce$exclude <- library_quality
cumul_dis <- plotScater(
sce,
nfeatures = 300,
colour_by = "exclude",
exprs_values = "counts"
)
cumul_dis <- cumul_dis +
scale_color_manual(values = c("black", "red2")) +
theme_bw() +
theme(panel.grid = element_blank())
cell_coverage_qc <- ggarrange(
plotlist = list(cell_coverage_hist, cumul_dis),
nrow = 1,
ncol = 2
)
cell_coverage_qc
According to the distribution, we remove those cells with a cell coverage lower than 350 detected genes:
table(sce$total_features_by_counts > 350)
##
## FALSE TRUE
## 288 11464
keep_cell_cov <- sce$total_features_by_counts > 350
The third cell filter we aim to apply is based on the percentage of counts of mitochondrial genes. It is expected that poor-quality cells are enriched for the expression of mitochondrial genes, likely because cells underwent apoptosis:
mt_genes_qc <- as.data.frame(colData(sce)) %>%
mutate(exclude = ifelse(pct_counts_MT > 10, TRUE, FALSE)) %>%
ggplot(aes(pct_counts_MT, fill = exclude, color = exclude)) +
geom_histogram(bins = 100, alpha = 0.65) +
geom_vline(xintercept = 10, linetype = "dashed") +
scale_x_continuous("Mitochondrial proportion (%)") +
scale_y_continuous(expand = c(0, 0)) +
scale_color_manual(values = c("black", "red2")) +
scale_fill_manual(values = c("black", "red2")) +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))
mt_genes_qc
According to the distribution, we remove those cells with a mitochondrial proportion greater than 10:
table(sce$pct_counts_MT < 10)
##
## FALSE TRUE
## 249 11503
keep_mt <- sce$pct_counts_MT < 10
After establishing the threshold for 3 QC metrics: library size, cell coverage and % of mitochondrial genes, we can classify cells as high and low quality. Note that, although there are cells that are outliers in all 3 metrics, we only require a cell to be an outlier in a single metric to be considered as low-quality:
sce$is_high_quality <- keep_lib_size & keep_cell_cov & keep_mt
We aim to assess visually if low-quality cells are indeed outlier cells. To that end, we can run and plot a tSNE:
sce$exclude <- !(sce$is_high_quality)
cell_quality_tsne <- plot_tsne(
sce,
exprs_values = "counts",
color_by = "exclude",
colors = c("gray62", "red2"),
point_size = 1.8,
point_alpha = 0.75
)
# Save tSNE
ggsave(
filename = str_c("results/plots/", date, "_tsne_low_quality_cells.pdf"),
plot = cell_quality_tsne,
device = "pdf",
width = 9,
height = 8
)
cell_quality_tsne
Interestingly, the cells we classified as poor quality cluster together. There are a few cells classified as low-quality in clusters with a great deal of high-quality cells. Thus, if we applied more stringent cutoffs, we would start losing important biological information.
We proceed to filter out poor-quality cells:
table(sce$is_high_quality)
##
## FALSE TRUE
## 414 11338
sce <- sce[, sce$is_high_quality]
sce
## class: SingleCellExperiment
## dim: 20719 11338
## metadata(0):
## assays(1): counts
## rownames(20719): ENSG00000238009 ENSG00000237683 ... ENSG00000215700 ENSG00000215699
## rowData names(11): id name ... total_counts log10_total_counts
## colnames(11338): AAACCTGAGATCACGG-1.1 AAACCTGAGATCTGCT-1 ... TTTGTCAGTTGGTAAA-1 TTTGTCATCAAACAAG-1
## colData names(41): batch donor ... exclude is_high_quality
## reducedDimNames(0):
## spikeNames(1): MT
Gene filtering must be performed right after cell filtering, as some genes may be exclusively expressed in poor-quality cells. The purpose of this step is to remove lowly expressed genes that do not possess enough information for reliable statistical analysis. Furthermore, the discreteness of the counts can affect the reliability of downstream analysis. These genes contain a great deal of dropout events: transcripts that are not detected in the final dataset even though the gene is expressed in the cell.
We will filter genes with a mean expression below a certain cutoff. Again, such cutoff will be data-driven, so let us start by visualizing the distribution of the mean expression:
mean_expr_df <- data.frame(
gene = rownames(sce),
mean_expression = rowMeans(counts(sce))
)
x_titl <- expression("log"[10]*"(mean expression)")
mean_expr_gg <- mean_expr_df %>%
mutate(exclude = ifelse(log10(mean_expression) < -2.25, TRUE, FALSE)) %>%
ggplot(aes(log10(mean_expression), fill = exclude, color = exclude)) +
geom_histogram(bins = 100, alpha = 0.65) +
geom_vline(xintercept = -2.25, color = "red", linetype = "dashed") +
scale_x_continuous(x_titl) +
scale_fill_manual(values = c("black", "red2")) +
scale_color_manual(values = c("black", "red2")) +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))
write.table(
mean_expr_df,
file = str_c("results/tables/", date, "_mean_gene_expression.tsv"),
sep = "\t",
row.names = FALSE,
col.names = TRUE
)
ggsave(
filename = str_c("results/plots/", date, "_mean_gene_expression.pdf"),
plot = mean_expr_gg,
device = "pdf",
height = 7,
width = 8
)
mean_expr_gg
We see that the distribution is bimodal, with the first peak corresponding to lowly expessed genes. We want our cutoff to fall somewhere between the two peaks, so a mean expression of 10-2.25 = 0.0056 UMI is a good choice:
keep_genes <- log10(mean_expr_df$mean_expression) > -2.25
table(keep_genes)
## keep_genes
## FALSE TRUE
## 10669 10050
sce <- sce[keep_genes, ]
sce
## class: SingleCellExperiment
## dim: 10050 11338
## metadata(0):
## assays(1): counts
## rownames(10050): ENSG00000228463 ENSG00000230368 ... ENSG00000215700 ENSG00000215699
## rowData names(11): id name ... total_counts log10_total_counts
## colnames(11338): AAACCTGAGATCACGG-1.1 AAACCTGAGATCTGCT-1 ... TTTGTCAGTTGGTAAA-1 TTTGTCATCAAACAAG-1
## colData names(41): batch donor ... exclude is_high_quality
## reducedDimNames(0):
## spikeNames(1): MT
In addition, we want to assess which are the highest expressed genes. We expect it to be housekeeping genes, such as actin beta (ACTB).
highest_expr_genes <- plotHighestExprs(sce, feature_names_to_plot = "name")
highest_expr_genes
We want to correct for two biases:
We will use the scran package to compute size factors for the count matrix and correct for the former biases:
sce <- computeSumFactors(sce)
summary(sizeFactors(sce))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1193 0.6488 0.9079 1.0000 1.2207 7.4508
sce <- normalize(sce)
We can see that the previous command introduced a new matrix in the “assays” layer of the SingleCellExperiment object, corresponding to the log-normalized expression matrix:
assays(sce)
## List of length 2
## names(2): counts logcounts
logcounts(sce)[1:6, 1:6]
## AAACCTGAGATCACGG-1.1 AAACCTGAGATCTGCT-1 AAACCTGAGGCTAGAC-1 AAACCTGAGGGATCTG-1 AAACCTGAGTCCATAC-1 AAACCTGAGTCGTTTG-1
## ENSG00000228463 0.8113824 0.0000000 0.0000000 0.000000 0 0
## ENSG00000230368 0.0000000 0.0000000 0.0000000 0.000000 0 0
## ENSG00000188976 0.0000000 0.0000000 0.0000000 1.706778 0 0
## ENSG00000188290 0.0000000 0.0000000 0.0000000 0.000000 0 0
## ENSG00000187608 0.0000000 0.9605817 0.7192687 1.706778 0 0
## ENSG00000186891 0.0000000 0.0000000 0.0000000 0.000000 0 0
Interestingly, we see that the size factors correlate almost perfectly with the library size:
plot(sizeFactors(sce) ~ sce$total_counts)
summary(lm(sizeFactors(sce) ~ sce$total_counts))
##
## Call:
## lm(formula = sizeFactors(sce) ~ sce$total_counts)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.49963 -0.10491 0.00341 0.10190 1.80697
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.535e-02 3.904e-03 16.74 <2e-16 ***
## sce$total_counts 4.287e-04 1.580e-06 271.24 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1953 on 11336 degrees of freedom
## Multiple R-squared: 0.8665, Adjusted R-squared: 0.8665
## F-statistic: 7.357e+04 on 1 and 11336 DF, p-value: < 2.2e-16
Let us save all QC plots into a single figure:
cumul_dis <- cumul_dis +
ylab("Cumulative proportion \n of library")
qc_gg_list <- list(
lib_size_qc,
cell_coverage_hist,
mt_genes_qc,
cumul_dis,
cell_quality_tsne,
mean_expr_gg
)
qc_gg <- ggarrange(
plotlist = qc_gg_list,
nrow = 2,
ncol = 3,
common.legend = TRUE,
labels = "auto"
)
ggsave(
plot = qc_gg,
filename = str_c("results/plots/", date, "_quality_control.pdf"),
device = "pdf",
width = 12,
height = 9
)
ggsave(
plot = qc_gg,
filename = str_c("doc/figures/R/", date, "_quality_control.pdf"),
device = "pdf",
width = 19,
height = 14.25,
units = "cm"
)
qc_gg
As we know, we have data from two different donors (male and female) and two different batches (JULIA_03 and JULIA_04). We need to assess whether that is introducing batch effects:
# Find Highly Variable Genes (HVG)
sce_var <- sce
fit_var <- trendVar(sce_var, use.spikes = FALSE)
decomp_var <- decomposeVar(sce_var, fit_var)
top_hvgs <- order(decomp_var$bio, decreasing = TRUE)
top_20_pct_hvgs <- top_hvgs[1:(0.2 * length(top_hvgs))]
sce_var <- sce_var[top_20_pct_hvgs, ]
# Run tSNE
sce_var <- runTSNE(object = sce_var, exprs_values = "logcounts")
# Create data frame and base plot
tsne_df <- reducedDim(sce_var, "TSNE") %>%
as.data.frame() %>%
set_colnames(c("TSNE1", "TSNE2")) %>%
mutate(batch = sce_var$batch, donor = sce$donor)
tsne <- ggplot(tsne_df, aes(TSNE1, TSNE2)) +
geom_point(size = 2) +
theme_bw() +
theme(panel.grid = element_blank())
# Batch effect (batch&donor)
tsne_batch <- tsne +
geom_point(aes(color = batch)) +
scale_color_manual(values = c("#5f74a0", "#c18c69"))
tsne_donor <- tsne +
geom_point(aes(color = donor)) +
scale_color_manual(values = c("#a6599f", "#36c987"))
tsne_batch
tsne_donor
We see a clear batch effect. However as there is an overlapping between “batch” and “donor” (03-male, 04-female), we do not know which one of the two is introducing it. Let us condition one on the other and inspect what is the primary source of variability:
# Batch conditioned on donor
male_batch <- tsne_df %>%
filter(donor == "male") %>%
ggplot(aes(TSNE1, TSNE2, color = batch)) +
geom_point(size = 1.5) +
scale_color_manual(values = c("#5f74a0", "#c18c69")) +
ggtitle("Male") +
theme_bw() +
theme(panel.grid = element_blank(),
plot.title = element_text(hjust = 0.5))
female_batch <- tsne_df %>%
filter(donor == "female") %>%
ggplot(aes(TSNE1, TSNE2, color = batch)) +
geom_point(size = 1.5) +
scale_color_manual(values = c("#5f74a0", "#c18c69")) +
ggtitle("Female") +
theme_bw() +
theme(panel.grid = element_blank(),
plot.title = element_text(hjust = 0.5))
ggarrange(plotlist = list(male_batch, female_batch), ncol = 2, nrow = 1)
# Donor conditioned on batch
julia_03_tsne <- tsne_df %>%
filter(batch == "JULIA_03") %>%
ggplot(aes(TSNE1, TSNE2, color = donor)) +
geom_point(size = 1.5) +
scale_color_manual(values = c("#5f74a0", "#c18c69")) +
ggtitle("JULIA_03") +
theme_bw() +
theme(panel.grid = element_blank(),
plot.title = element_text(hjust = 0.5))
julia_04_tsne <- tsne_df %>%
filter(batch == "JULIA_04") %>%
ggplot(aes(TSNE1, TSNE2, color = donor)) +
geom_point(size = 1.5) +
scale_color_manual(values = c("#5f74a0", "#c18c69")) +
ggtitle("JULIA_04") +
theme_bw() +
theme(panel.grid = element_blank(),
plot.title = element_text(hjust = 0.5))
ggarrange(plotlist = list(julia_03_tsne, julia_04_tsne), ncol = 2, nrow = 1)
Indeed, we see how both variables introduce variability in the data. In downstream analysis, we will separate male and female and treat them as biological replicates. Furthermore, we will introduce “batch” as a covariate in the analysis.
We have our SCE filtered and normalized. We can now select the columns of interest in the colData and rowData slots, and then save the object as .RDS file to use in future analysis.
colData(sce) <- colData(sce)[, c("batch", "donor", "condition")]
saveRDS(
sce,
file = "results/R_objects/10X_SingleCellExperiment_filt&norm.RDS"
)
sessionInfo()
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] grid parallel stats4 stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] forcats_0.4.0 dplyr_0.8.0.1 readr_1.3.1 tidyr_0.8.3 tibble_2.1.1 tidyverse_1.2.1 Seurat_2.3.4 cowplot_0.9.4 gridGraphics_0.3-0 gridExtra_2.3 purrr_0.3.2 SC3_1.10.1 scran_1.10.2 scater_1.10.1 SingleCellExperiment_1.4.1 SummarizedExperiment_1.12.0 DelayedArray_0.8.0 BiocParallel_1.16.6 matrixStats_0.54.0 Biobase_2.42.0 GenomicRanges_1.34.0 GenomeInfoDb_1.18.2 IRanges_2.16.0 S4Vectors_0.20.1 BiocGenerics_0.28.0 ggpubr_0.2 magrittr_1.5 ggplot2_3.1.0 fitdistrplus_1.0-14 npsurv_0.4-0 lsei_1.2-0 survival_2.44-1.1 MASS_7.3-51.3 pheatmap_1.0.12 kmed_0.2.0 psych_1.8.12 stringr_1.4.0 Matrix_1.2-17 BiocStyle_2.10.0
##
## loaded via a namespace (and not attached):
## [1] reticulate_1.11.1 R.utils_2.8.0 tidyselect_0.2.5 htmlwidgets_1.3 trimcluster_0.1-2.1 Rtsne_0.15 munsell_0.5.0 codetools_0.2-16 ica_1.0-2 statmod_1.4.30 withr_2.1.2 colorspace_1.4-1 knitr_1.22 rstudioapi_0.10 ROCR_1.0-7 robustbase_0.93-4 dtw_1.20-1 gbRd_0.4-11 labeling_0.3 Rdpack_0.10-1 lars_1.2 GenomeInfoDbData_1.2.0 mnormt_1.5-5 bit64_0.9-7 rhdf5_2.26.2 generics_0.0.2 xfun_0.5 diptest_0.75-7 R6_2.4.0 doParallel_1.0.14 ggbeeswarm_0.6.0 locfit_1.5-9.1 hdf5r_1.0.1 flexmix_2.3-15 bitops_1.0-6 assertthat_0.2.1 promises_1.0.1 SDMTools_1.1-221 scales_1.0.0 nnet_7.3-12 beeswarm_0.2.3 gtable_0.3.0 rlang_0.3.3 splines_3.5.1 lazyeval_0.2.2 acepack_1.4.1 broom_0.5.1
## [48] checkmate_1.9.1 modelr_0.1.4 BiocManager_1.30.4 yaml_2.2.0 reshape2_1.4.3 backports_1.1.3 httpuv_1.5.0 Hmisc_4.2-0 tools_3.5.1 bookdown_0.9 gplots_3.0.1.1 RColorBrewer_1.1-2 proxy_0.4-23 dynamicTreeCut_1.63-1 ggridges_0.5.1 Rcpp_1.0.1 plyr_1.8.4 base64enc_0.1-3 zlibbioc_1.28.0 RCurl_1.95-4.12 rpart_4.1-13 pbapply_1.4-0 viridis_0.5.1 zoo_1.8-5 haven_2.1.0 cluster_2.0.7-1 data.table_1.12.0 lmtest_0.9-36 RANN_2.6.1 mvtnorm_1.0-10 hms_0.4.2 mime_0.6 evaluate_0.13 xtable_1.8-3 readxl_1.3.1 mclust_5.4.3 compiler_3.5.1 KernSmooth_2.23-15 crayon_1.3.4 R.oo_1.22.0 htmltools_0.3.6 segmented_0.5-3.0 pcaPP_1.9-73 later_0.8.0 Formula_1.2-3 snow_0.4-3 rrcov_1.4-7
## [95] lubridate_1.7.4 WriteXLS_4.1.0 fpc_2.1-11.1 cli_1.1.0 R.methodsS3_1.7.1 gdata_2.18.0 metap_1.1 igraph_1.2.4 pkgconfig_2.0.2 registry_0.5-1 foreign_0.8-71 xml2_1.2.0 foreach_1.4.4 vipor_0.4.5 rngtools_1.3.1 pkgmaker_0.27 XVector_0.22.0 rvest_0.3.2 bibtex_0.4.2 doRNG_1.7.1 digest_0.6.18 tsne_0.1-3 cellranger_1.1.0 rmarkdown_1.12 htmlTable_1.13.1 edgeR_3.24.3 DelayedMatrixStats_1.4.0 kernlab_0.9-27 shiny_1.2.0 gtools_3.8.1 modeltools_0.2-22 jsonlite_1.6 nlme_3.1-137 Rhdf5lib_1.4.3 BiocNeighbors_1.0.0 viridisLite_0.3.0 limma_3.38.3 pillar_1.3.1 lattice_0.20-38 httr_1.4.0 DEoptimR_1.0-8 glue_1.3.1 png_0.1-7 prabclus_2.2-7 iterators_1.0.10 bit_1.1-14 mixtools_1.1.0
## [142] class_7.3-15 stringi_1.4.3 HDF5Array_1.10.1 doSNOW_1.0.16 latticeExtra_0.6-28 caTools_1.17.1.2 irlba_2.3.3 e1071_1.7-1 ape_5.3